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Abstract 

' Improved EM strategies, based on the idea of efficient data augmentation (Meng and van 

■ Dyk 1997, 1998), are presented for ML estimation of mixture proportions. The resulting 

) ' ! algorithms inherit the simplicity, ease of implementation, and monotonic convergence prop- 

erties of EM, but have considerably improved speed. Because conventional EM tends to be 
slow when there exists a large overlap between the mixture components, we can improve the 
speed without sacrificing the simplicity or stability, if we can reformulate the problem so as 
to reduce the amount of overlap. We propose simple "squeezing" strategies for that purpose. 
Moreover, for high-dimensional problems, such as computing the nonparametric MLE of the 
distribution function with censored data, a natural and effective remedy for conventional EM 
is to add exchange steps (based on improved EM) between adjacent mixture components, 
where the overlap is most severe. Theoretical considerations show that the resulting EM-type 
. algorithms, when carefully implemented, are globally convergent. Simulated and real data 

| examples show dramatic improvement in speed in realistic situations. 

i— l . 

>- ! Keywords: AECM; cocktail algorithm; data augmentation; doubly censored data; EM; global 
convergence; NPMLE; nonparametric mixtures; squeezing; vertex exchange method. 

1 Introduction 

Several statistical problems give rise to a likelihood function formally equivalent to that of a fi- 
nite mixture model with known component densities. One example is maximum likelihood (ML) 
estimation in a saturated multinomial model with ignorable missing data where some units are 
partially classified (Dempster et al. 1977). Another example is nonparametric ML estimation 
(NPMLE) of a mixing distribution (Lindsay 1983) when this distribution is assumed to be sup- 
ported on a finite grid. Closely related is the NPMLE problem for the distribution function for 



censored data (Groeneboom and Wellner 1992; Bohning et al. 1996). The EM algorithm (Demp- 
ster et al. 1977; Meng and van Dyk 1997) is among the simplest and best known methods for 
ML computation in such mixture-like problems; Turnbull (1976) used it on censored data before 
Dempster et al. (1977) laid down the general framework. The potential slow convergence of EM 
in general, and for NPMLE computation in particular, is also well documented. Other methods 
of computing the NPMLE include the iterative convex minorant (ICM) algorithm (Aragon and 
Eberly 1992; Jongbloed 1998), the vertex exchange method (VEM; Bohning et al. 1996), and 
constrained Newton methods (Wang 2008). EM or EM-like algorithms are also widely used for 
related problems such as optimal experimental design (Silvey et al. 1978; Yu 2010a), Poisson im- 
age reconstruction using positron emission tomography (Vardi et al. 1985), and channel capacity 
calculations in Shannon theory (Arimoto 1972; Blahut 1972; Csiszar and Tusnady 1984). 

This paper is concerned with improved EM strategies for maximizing a finite mixture log- 
likelihood with known component densities. Possible extensions to more general problems are 
mentioned in Section 5. Our main motivation is fast computation of the NPMLE for censored 
data. The NPMLE problem is challenging partly because of the high dimension (there are many 
mixture components), and the heavy overlap between components, which slows down conventional 
EM. Our goal is to design algorithms that improve the speed of EM, but preserve its simplicity, 
ease of implementation, and monotonic convergence properties. First, we introduce "squeezing" 
strategies that reformulate the problem so as to reduce the overlap between component densities. 
Such squeezing strategies capitalize on the idea of efficient data augmentation and are inspired by 
Fessler and Hero (1994). The resulting EM algorithms converge faster because they correspond 
to augmented data that are less informative. Secondly, we observe that although "squeezing" 
may not always be effective for the entire collection of mixture components, we can apply it to 
sub-collections that overlap most severely. Adding such EM-based conditional maximization steps 
(nearest neighbor exchanges) can improve the speed dramatically. Overall, our algorithms fit in 
the broad spectrum of alternating-expectation-conditional-maximization (AECM) schemes (Meng 
and van Dyk 1997, 1998). The simplicity and effectiveness of these algorithms testify to the 
advantage of working within the general EM framework (Dempster et al. 1977; Wu 1983; Meng 
and Rubin 1993; Liu and Rubin 1994; Meng and van Dyk 1997; Liu et al. 1998). Also relevant 
is the work of Pilla and Lindsay (2001), who focus on the nonparametric mixture problem and 
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propose pairing nearby components and rotating the pairs for fast ML computation. 

In Section 2, we introduce the squeezing strategies to improve EM for maximizing a mixture log- 
likelihood. Intuitively, squeezing yields an equivalent problem where the mixture components have 
less overlap, and its effect on the speed of EM is explained in terms of efficient data augmentation. 
Section 3 argues that squeezing strategies can be effectively implemented to sub-collections of 
the mixture components. This leads to a "cocktail algorithm" with several different moves that 
complement each other. A global convergence theorem for the cocktail algorithm is proved. A real- 
data example is included as an illustration. Section 4 focuses on the NPMLE problem for censored 
data, and demonstrates the effectiveness of our new algorithms through simulation. Section 5 
concludes with a discussion on possible extensions (to the bivariate interval censoring problem, 
for example). Efficient implementations of our EM- type algorithms, which take advantage of the 
sparsity features of the NPMLE problem for censored data, are collected in the appendix. 

2 EM Algorithms for Mixture Proportions 

Suppose n observations y = (y±, . . . , y n ) are taken from a mixture of m known densities with 
unknown proportions Pi, ■ ■ ■ ,p m - Writing /y as the jth component density evaluated at yi, we 
can express the log- likelihood for p = (pi, . . . ,p m ) as 



2.1 Conventional EM 

Conventional EM introduces latent indicators such that Uj = 1 if the ith observation is from 
component j, and Iij = otherwise. At iteration t, when the current estimate of p is p^ = 
[Pi , . . . ,Pm), the E-step simply computes the conditional expectation of 1^ given observed data 




(2.1) 



We seek to maximize (12. ip over p e where 




and p^: 



E{l %] \y^) 




3 



The M-step then sets ^ = J^ILi ^ (^y ll/> P^) / n - Overall each iteration can be written as 

The EM algorithm maintains monotone increase in the log-likelihood, i.e., Z(p^ +1 ^) > Z(pW). 
Moreover, when started from the interior of the parameter space, i.e., when > for all 
1 < j < m, EM is guaranteed to converge to p, the MLE (Csiszar and Tusnady 1984). Convergence 
is potentially very slow, however, when there exists heavy overlap among the mixture components. 

2.2 Squeezing Strategy I 

To improve conventional EM, let us introduce an auxiliary vector g = (g 1; . . . , g n ) and write the 
objective function (12. ip as 

n / m m \ 

Z(p) =ntog2 + £log £ g iPj /2 + - 9i ) Pj /2 (2.3) 

i=i \j=i j=i J 

n / m \ 

= n log 2 + lo S + - 9i)Pj/2 • (2-4) 

We require ^ > and fij — gi > for all i, i.e., 

< fi-i < min/y. (2.5) 

j 

Conventional EM (12. 2 p can also be derived from (I2.3p . viewing it as a mixture log-likelihood 
with 2m components with proportions Pj/2, j = l,...,m, each appearing twice. Specifically, 
under this new formulation, we let the density of component j (j = 1, . . . , 2m) evaluated at 
observation i be 

fij -9i, 1 < j < m, 

9ii Tn + 1 < j < 2m. 

Let Iij be the latent indicator of whether the ith observation is from component j, j = 1, . . . , 2m. 
The jth component has proportion Pj/2 if 1 < j < m, and pj„ m /2 if m + 1 < j < 2m. Then the 
E-step becomes 



fi 



E { I i3 - 



y ,pW = _ x t ~ (2.6) 

m t l^k=i JikPk \p)- m i m+l<j<2m, 



and the M-step becomes 

n 

pf +1) cxJ2E(k + I hJ+ m\y,P {t) ), j = l,...,m. (2.7) 



8=1 



Routine algebra reveals that the resulting EM iteration is the same as (12. 2p . 

We can also apply EM to maximize (12.41) . viewing it as a problem with m + 1 mixture compo- 
nents, one of which has proportion 1/2. Equivalently, in the above derivation of (12. 6 p and (12. 7p . 
instead of 1^, 1 < i < n, 1 < j ' < 2m, let us treat 

2m 

hi, 1 < i < n, 1 < j < m, and /jo = hj-, 1 < i < 



"5 

j'=m+l 



as the set of latent indicators. The latter, being a collapsed version of the former, contains less 
information about p. Note that Jjo is the indicator of a mixture component with proportion 1/2, 
whose density evaluated at observation i is gi. The E-step proceeds to calculate the conditional 
expectation of Z^-, j — 0, 1, . . . , m, resulting in the same formula as (I2.6P for j = 1, . . . , m. The 
M-step becomes 



(i+i) 
p) oc 



i=l 



J^E (iij Z/,P (t) ), j = l,...,m. (2.* 



The conditional expectation of Uq does not appear in (12. 8p . Combining (12. 6 p (for j = 1, . . . ,m) 
with (I2.8p . we obtain an EM iteration as 



The iteration (12. 2 p corresponds to (12. 9p with g = 0. 

Because (12 .9p is derived in the EM framework, it inherits nearly all the desirable properties of 
(12. 2p . For example, each iteration of (12 .9p increases the log-likelihood (12 .4p . Furthermore, because 
the convergence rate of EM is determined by the fraction of missing information, we know that 
(12. 9p converges faster than (I2.2p because it is based on a reduced set of latent variables. This 
is an example of efficient data augmentation (Meng and van Dyk 1997): we speed up EM by 
augmenting less. This strategy of improving (12. 2 p is called a squeezing strategy because the key 
equivalent formula (12. 4p is obtained by subtracting ( "squeezing out" ) a nonnegative vector g from 
each component density. 
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A slight extension of the above discussion shows that the convergence rate of ( 12. 9ft is monotonic 
in the squeezing parameter g. Because of the restriction (12.51) . the optimal g is therefore the upper 
bound gi = min., /y, that is, we perform as much squeezing as allowed. This optimal g may be 
viewed as the overlap among all component densities. If this overlap is small, i.e., g is close to 
a vector of zeros, then (12.91) is not very different from (12. 2\i . Hence we may expect significant 
speedup only if the overlap is large enough. 

This squeezing strategy (as well as the strategy of Section 2.3) is inspired by the work of Fessler 
and Hero (1994) on efficient EM for Poisson image problems. Similar strategies also work for the 
Arimoto-Blahut algorithm for calculating the Shannon capacity of a discrete memoryless channel 
(see Yu 2010b). 

2.3 Squeezing Strategy II 

The squeezing strategy of Section 2.2 can be improved by further manipulation of the log- 
likelihood. Let us introduce another auxiliary vector — (/?i, . . . , (3 m ), and rewrite (12 .41) as 

/(p) =nlog(2 + /3 + ) + 2^1og 2 + ]^ ^ ' 



i=l 



= nlog(2 + /?+) + 2^ log J - J , (2.11) 

i=i + ' + 

where = — gi as before, and f3 + = Y^T=i Pj- ft is required that 

m 

Pj > 0, 1 < J < m; gi-J2 faPj ^ °> (2-12) 

3=1 

When m = 2 and = min., fy, (12. 12[) is equivalent to 

< p x < min - fi \ , < /3 2 < min - ^ - . (2.13) 

»:/ii>/«2 /a — Ji2 »:/»2>/ii Ji2 — /il 

In general, however, it is not clear how to reduce ( 12.121) to an explicit range for /3. Although we 
do address the choice of (3 in this section, further study is desired. 

Iteration (12. 9p can also be derived from (12.101) . viewing it as a mixture log-likelihood with 
2m + 1 components. The density of the jth component (0 < j < 2m) evaluated at observation i 
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IS 

( 

9i-Y!k=ifikPk, j = o, 

ft = \ fij, l<j<m, 

k fi,j-m, m + 1 < j < 2m. 

The proportion of the jth component is (2 + if j = 0, Pj/(2 + (3 + ) if 1 < j < m, and 

fij-m/ (2 + /?+) if m + 1 < j < 2m. The factor 2 + f3 + makes these proportions sum to one. Let ifj 
denote the indicator of whether observation i is from component j. Similar to Section 2.2, if we 
treat Z^, 1 < i < n, < j < 2m, as latent variables, then the resulting EM iteration is precisely 
(EH}. 

On the other hand, we can derive an EM iteration based on (12. lip , viewing it as a mixture 
log-likelihood with m + 1 components, one of which has proportion (2 + and the others 

have proportions (pj + /3j)/(2 + (3 + ), j — 1, . . . ,m. Equivalently, we treat 

I*, 1 < % < n, and I* + I* +m , 1 < % < n, 1 < j < m, (2.14) 

as latent indicators instead of the entire collection /*, 1 < i < n, < j < 2m. The E-step, as 
before, is to calculate the conditional expectations. We have 

h (p? + Pi) 



ij 1 i,j+m 



y,p J , sr m f (*) ~ r 



The M-step seeks to maximize the function 

m n 

^^/^•logfe + /?,). (2.15) 
j=i i=i 

By checking the Karush-Kuhn- Tucker conditions, it can be shown that (I2.15P is maximized by 
choosing pj as 

= max jo, 5 K ij ~ Pj\ , 1 < 3 < m , ( 2 - 16 ) 

where 5 is determined by the constraint Y^jLiPf^ = 1- Iteration (12. 9 p corresponds to (12 . 1 6[) 
with /3j = 0. 

The new EM iteration (I2.16P is more complicated than (12 .9p or (I2.2p . but only slightly so. 
First, we observe that the right-hand side of (12. 161) is a continuous and increasing function of 5, 
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and hence a 5 exists to ensure Yl^iPf^ = 1, as long as ^ Kij > for some j. Moreover, such a 
S can be found efficiently (in 0(m log m) time) using a "waterfilling" algorithm (see Appendix A 
in Yu 2010b). 

The convergence rate of (12.161) is no worse than that of (12.91) . because the latent variables 
(I2.14p are less informative than the entire collection J*, 1 < i < n, < j < 2m, which lead 
to (12. 9p . A slight extension of this argument reveals that, for fixed g, the convergence rate of 
(I2.16P is monotonic in (3 = (/3 1; . . . , (3 m ). Such results resemble those of Yu (2010b) on improved 
Arimoto-Blahut algorithms for channel capacity calculations. In Yu (2010b), the convergence 
rate comparison results are derived by calculating the matrix rate and analyzing its eigenvalues. 
Because we work within the EM framework, however, the convergence rate comparison results 
here are obtained automatically once we specify the appropriate latent variables. 

Combined with the convergence rate comparisons of Section 2.2, the above considerations 
suggest the following guideline for choosing the squeezing parameters g and (3. 

• Choose gi = minj fy, which satisfies the upper bound in (12.51) . 

• Choose (3j to be as large as possible, subject to (I2.12p . 

When m = 2, the condition (I2.12p reduces to (I2.13p . Hence we recommend setting (3 at the upper 
bounds in (I2.13p . 

It is helpful to write down an explicit formula for (I2.16P with these optimal choices of g and 
(3 in the m = 2 case. Actually, for later convenience, we present a slightly more general iterative 
formula for maximizing (m = 2) 

n 

Z(p) = lo S + /iiPi + /«Pa) > ( 2 - 17 ) 

subject to pj > 0, j = 1, 2, and pi + p 2 = (3 . Here r i5 i — 1, . . . , n, are nonnegative constants, 
and /9 > is fixed. Define g^ = min{/j!, f i2 }, and 

(3 1 = mm — — , (3 2 = mm — — . 

v.fa>fi2 Jii — Ji2 r.fi 2 >fii Ji 2 — Jii 

Suppose the current parameter estimate is p®. We compute 

n „ 

v J i= i n + fnPi + f i2 p 2 
8 



Then we update as 

pf +1) = max{0, min{/3 , (ft + ft + h)S j /{S 1 + S 2 ) - /?,}}, j = 1, 2, (2.18) 

which is a slight generalization of (I2.16P for m = 2. The iteration (" 12 . 18[) is uniquely defined if 
Si + S 2 7^ 0, for which it suffices to have fa — fa ^ 0. (Inspection shows that Pj+/3j > 0, j — 1,2, 
as long as /(p 1 -** 1 ) > — oo.) If /a — fa = 0, then the M-step is not unique, but it is convenient to set 
p(*+i) _ p(t) Because of the max and min operations, ( 12 . 1 8[) can potentially transfer all the mass 
from one component to the other in a single step. This will be especially useful in later sections 
after we introduce nearest neighbor exchanges. 

3 Nearest Neighbor Exchanges and the Cocktail Algo- 
rithm 

3.1 Nearest Neighbor Exchanges 

The intuition that conventional EM tends to be slow when there exists heavy overlap between 
mixture components is used to our advantage in Section 2 for designing faster EM schemes via 
squeezing. However, the strategies so far begin by squeezing out a common vector g from each of 
the components. When there exist many components, it is conceivable that squeezing applied to 
all components may not be effective, even though a sub-collection may have severe overlap. Then 
it is worthwhile to apply some form of "local squeezing" to a sub-collection of components. 

For example, consider a nonparametric mixture problem where the observations yi are assumed 

to be drawn independently from a mixture of normals 

in 

Vi ~ ^PjN(/i i; l). 

3=1 

The variance is fixed for simplicity. We assume the mixing distribution puts mass pj on N(/ij, 1), 
where \ij is obtained by discretizing an interval, say fij = Uj/m, and U > denotes the largest of 
the normal means. As m increases, the overlap between adjacent densities N(/Xj, 1) and N(/i J+1 , 1) 
increases, and conventional EM slows down. The global squeezing strategies may not be effective, 
however, because the overlap between the left-most density N(U/m, 1) and the right-most density 
N(Z7, 1) can still be small. 
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A natural remedy, therefore, is to exchange the mass between each pair of nearby components 
in turn, holding the other components fixed. This is similar to (but somewhat simpler than) 
the paired and rotated EM of Pilla and Lindsay (2001). In general, given the current parameter 
estimate p®, let ji < • ■ ■ < j g +i be the elements of jj : > oj where q + 1 is the number of 
support points of pW. We perform mass exchanges between jk an d jk+i for k — 1, . . . , q in turn, 
i.e., 

P {t+k/q) = VE {j k ,j k+1 , p^-D/*)) , k = 1,. . . ,q. (3.1) 
We use p = VE(u, v , p), u ^ v, to denote an update of the form 



Pi 



Pi, 3 i {u, v}, 
Pj + 5, j = u, 
Pj -5, j = v, 



where 5 G [— p u ,Pv) is chosen so that /(p) > /(p). To choose the step-length 6, we naturally 
use (12.181) . The iteration (12.181) is applicable because, when other components are held fixed, the 
log-likelihood for p u ,p v is exactly in the form of (12.171) . Because (12.181) is an EM iteration, the log- 
likelihood is automatically monotonic. Moreover, (I2.18P is easy to implement, and very amenable 
to theoretical analysis. These all add to the appeal of (I2.18P when compared with standard tools 
such as Newton's method. We refer to the composite mapping pw — > p(*+ 1 ) given by (13. ip as the 
set of nearest neighbor exchanges (NNEs). 

We have found that by adding nearest neighbor exchanges based on d 2 . 1 8 [) to conventional 
EM can lead to considerably improved speed. There is one caveat, however. Conventional EM 
is usually started at the interior of the parameter space, i.e., p^ > for all j, because once a 
component receives zero mass, it does so in all subsequent iterations. By adding nearest neighbor 
exchanges, certain pj may be set to zero. While this has the desirable effect of eliminating bad 
support points, it may accidentally eliminate a good one, and yield a suboptimal solution. The 
problem is easily remedied, however, by adding in the following step, known as the vertex direction 
method (VDM; Fedorov 1972). Given the current parameter estimate p, we first calculate the 
derivatives 

3 dPj i^t Vi ' 
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m, is maximized. Then 



(3.2) 



where 6 G [0, 1] is chosen so that Z(p) > /(p). We use iteration (I2.18P for choosing S because Z(p) 
as a function of (5, 1 — 5) is again in the form of (j2.17|) . 

Let us denote the mapping (13. 2p with 5 chosen by (I2.18P as p = VDM(j#, p). In Section 3.2 we 
show that VDM based on (I2.18p . when added to conventional EM and nearest neighbor exchange 
iterations, results in a globally convergent algorithm. That is, starting from any p(°) G O, all limit 
points of the resulting algorithm are global maxima of the log-likelihood function on G. The proof 
actually shows that by adding VDM to any monotonic algorithm we obtain a globally convergent 
algorithm. 



We summarize a "cocktail algorithm" based on VDM, nearest neighbor exchanges, and EM steps. 
A convergence proof is then provided. Empirical evaluation of such a strategy is presented in 
Section 3.3. Yu (2009) applies this strategy to the D-optimal design problem, and reports dramatic 
improvement in speed. We show similar performance for the mixture problem. 

Cocktail Algorithm 

1 At iteration t, first perform a VDM step (13. 2p where 5 is chosen using (I2.18p . 

2 Then use the output of VDM, say p, as input for the nearest neighbor exchanges, i.e., (13. ip . 

again based on (I2.18p . 

3 Finally, update the output of ( 13. ip using ( I2.2p . i.e., conventional EM, to obtain the next iterate 



Note that this is only one of the potential algorithms based on reducing the overlap between 
component densities. There is much room for further exploration. One could consider, for example, 
an algorithm that uses only Steps 1 and 2 above at each iteration. That is, a VDM step is combined 



3.2 The Cocktail Algorithm 



P 
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with nearest neighbor exchanges. We call this algorithm NNE+. We design the cocktail algorithm 
in the hope that the nearest neighbor steps and conventional EM can complement each other, 
since NNE+ focuses on purely local modifications, whereas EM focuses on purely global ones. As 
we shall illustrate in empirical examples, the performance of NNE+ or conventional EM (each by 
itself) can be poor, but the cocktail algorithm is very fast. 

Because the cocktail algorithm consists of many EM sub-steps, the log-likelihood is guaranteed 
to increase at each iteration. Further analysis yields the following convergence theorem. 

Theorem 1. The cocktail algorithm is globally convergent. That is, ifp® is a sequence generated 
by the cocktail algorithm starting from any p(°) E O, then all limit points ofp^' are global maxima 
of Z(p) on p G ©. 

Proof. The proof is similar to that of Theorem 1 in Yu (2009). Let denote the output of the 
VDM step at iteration t. By monotonicity, 

Z(pW) < /(p (t) ) < /( P (m) ). 

Hence the two sequences Z(p^) and Z(pW) tend to the same (finite) limit. Let p* be a limit 
point of p^, and let p^ be a subsequence converging to p*. Without loss of generality, we 
may assume that the VDM steps p^^ — > p^' are all performed on the same index k = j# as 
in ( 13. 2D . since at least one of the m indices will appear infinitely often. If = fijPj f° r 
some p = p^ or p = p*, then we can show directly that dl(p)/dpk = n. By the choice of 
k, we have dl(p)/dpj < n for all 1 < j < m, and hence p is already a global maximum by 
the general equivalence theorem (Lindsay 1983). Assume fn- ^ fijPj f° r ah P — an d 
p = p*. Then inspection of (13. 2p and ( 12. 18ft reveals that all VDM steps p^ — > p^ tj ^ are uniquely 
defined. Moreover, when k is considered fixed, the VDM mapping is continuous at p*. Hence p^^ 
converges to VDM(k, p*) = p, say, and Z(p) = Z(p*) as a result. Since this VDM step p* — > p is 
derived in the EM framework, p being uniquely defined means that it is the unique maximizer at 
the M-step. If p ^ p*, then the expected complete-data log-likelihood increases strictly, and so 
does the observed log-likelihood, which contradicts Z(p) = Z(p*). It follows that p = p*, i.e., p* is 
a fixed point of the VDM mapping. Inspection of ( 13 .21) and ( 12.181) . however, shows that this fixed 
point must satisfy dl(p*)/dpk < n, which implies that p* is a global maximum. □ 
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3.3 Numerical Illustration 

This section gives a numerical illustration of the effectiveness of the cocktail algorithm. The 
cocktail algorithm is compared with conventional EM, i.e., iteration (12. 2)) , the algorithm NNE+ 
mentioned in Section 3.2, and the vertex exchange method (VEM) of Bohning (1985). To describe 
VEM, suppose the parameter estimate at iteration t is p^. Define dj = dl (p^) /dpj, and let j# 
and j# be indices between 1 and m such that 

din = min gL, gL# = max cL. 

j:pV>Q r m 

VEM sets p (m) as 

P ^ = VE(j*, M ,pV). 

That is, we exchange the mass between the indices j and j# so as to increase the log-likelihood. 
We again employ ( 12 . 18[) to choose the step-length. 

We run each of (conventional) EM, NNE+, VEM, and the cocktail algorithm until convergence 
and record the number of iterations and computing time. An iteration of NNE+ consists of 
one iteration of VDM and the set of nearest neighbor exchanges. An iteration of the cocktail 
algorithm consists of one iteration each of NNE+ and conventional EM. We shall concentrate on 
the computing time as a more objective measure of performance. All calculations are performed 
on the same Sun Solaris 10 machine, and the computing time is recorded using the R function 
system.timeQ. The program is written in C and is available, together with the R interface, upon 
request from the author. 

Each algorithm is started from the same uniform probability vector, i.e., p^ = 1/m, j = 
1, . . . , m. We use the common convergence criterion 

max dj — n < e, (3-3) 

l<j<m 

where dj = dl(p)/dpj. The theoretical basis for (13. 3ft is that for any p that satisfies this criterion 
we have 

Kp) - Kp) < e, 

where p is the MLE (Lindsay 1983, Bohning et al. 1996). We choose e = 10 -6 in our experiments. 

EM, NNE+, VEM, and the cocktail algorithm are tested on data taken from Roeder (1990) 
concerning the velocities of 82 galaxies. Following Pilla and Lindsay (2001), we fit a normal finite 
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Table 1: Iteration count and computing time (in seconds) until convergence for four algorithms 
on the galaxy data. 





Iteration count 




Computing time 


EM 


NNE+ VEM Cocktail 


EM 


NNE+ VEM Cocktail 


21777 


74 974 36 


87 


0.02 0.13 0.02 



mixture model to these data. The means of the normal components lie on a grid of 64 equally- 
spaced points from 10.0 to 33.94, and the common standard deviation is a — 0.95. The algorithms 
deliver the same MLE as reported by Pilla and Lindsay (2001), and their performance is recorded 
in Table 1. 

Clearly the nearest neighbor exchanges are very effective, since both NNE+ and the cocktail 
algorithm improve conventional EM dramatically, reducing its computing time by orders of mag- 
nitude. Adding conventional EM to NNE+ appears to have increased the computing time per 
iteration, but decreased the number of iterations, so that the cocktail algorithm and NNE+ have 
similar overall computing time. We remark that the cocktail algorithm is also appealing because 
it is easy to implement and requires virtually no tuning. 

4 Efficient EM for Computing the NPMLE for Censored 
Data 

We show that the EM strategies designed in Sections 2 and 3, in particular the cocktail algorithm of 
Section 3, are also effective for the NPMLE problem for censored data. Section 4.1 briefly reviews 
how this problem can be viewed as a problem of mixture proportions. Section 4.2 highlights fast 
implementations of our algorithms. Section 4.3 contains numerical illustrations using simulated 
data. 
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4.1 NPMLE for Censored Data 

Assume that failure time data collected from n units are independent and identically distributed 
according to a distribution function F, except that they are subject to censoring. Following 
Gentleman and Geyer (1994), assume there is an inspection time process Q which is independent 
of the failure times, and suppose each unit is subject to inspections governed by Q independently. 
The observed data then consist of n observation intervals (k,ri], where k is the last inspection 
time prior to failure and is the first inspection time after failure for subject i. Right censoring 
may be represented by r« = oo, and left censoring by k = 0; the exact failure time is observed 
when li coincides with i.e., when the individual is subject to continuous inspections. 

Doubly censored data arise when units are subject to both right and left censoring. The 
inspection time process is given by a pair of random variables (L, U), L < U . If the failure time 
T satisfies L < T < U, then T is observed; if T < L, then the observation is left censored at L; 
if T > U, then the observation is right censored at U. For (case 2) interval censored data, the 
failure time T is not observed, but only known to fall within a random interval. 

In either case, let us order the distinct observation times (i.e., end points of the intervals (/j, r^]) 
as = z < z\ < . . . < z m -i < z m = oo. Denote 

P = (pi, • • ■ ,p m ), Pj = F ( z j) ~ F ( z j-i), 
and define the n x m matrix (/^-) by 

J 1, Zj G (li, ri\, 
I 0, otherwise. 

For notational convenience we assume that observation intervals are open at the left, and closed 
at the right, end points, though extension to the general case is trivial; if the ith observation is 
exact, for example, we simply set /y = 1 if Zj = U{= r») and /y = otherwise. Seeking the 
maximizer F that jumps only at observed time points, we note that the log-likelihood function is 
exactly ( 12. ip . Thus all the EM-type algorithms in Sections 2 and 3 can be applied to compute the 
NPMLE, defined as the p that maximizes ( 12. ip . 
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4.2 Efficient Implementations 

Straightforward implementation of conventional EM, NNE+, VEM, or the cocktail algorithm, as 
described in Sections 2 and 3, each requires 0(nm) time per iteration. For univariate censored 
data, however, one can take advantage of the special structure of the matrix (f^) to derive 0(n) 
implementations. This is a substantial reduction since m is often of the same order of magnitude 
as n. It is especially relevant in situations such as bootstrap resampling, when repeated use of the 
algorithms is needed. 

Noted only briefly by Jongbloed (1998), who does not give the technical details concerning 
computational complexity, this possibility of fast EM implementation has remained largely un- 
noticed. Zhang and Jamshidian (2004) present fast implementations for doubly censored data, 
but not (case 2) interval censored data. In Appendix A, we show that all four algorithms admit 
efficient implementations for univariate interval censoring in general. 

4.3 Evaluation of Algorithms on Doubly Censored Data 

In this section we evaluate the effectiveness of EM, NNE+, VEM, and the cocktail algorithm for 
computing the NPMLE for doubly censored data. Simulations are performed under conditions 
similar to those of Wellner and Zhan (1997), and Zhang and Jamshidian (2004). Wellner and 
Zhan (1997) propose an effective algorithm which combines EM and iterative convex minorant 
(ICM; Jongbloed 1998) iterations. We have decided to focus on evaluating the cocktail algorithm 
relative to EM, NNE+ and VEM because they are easy to describe and easy to implement. A full 
evaluation, including the case of bivariate censoring, is work in progress. 

Our simulation setting is as follows. The failure time Tj for unit i is generated as an independent 
exponential random variable with mean 1, and the censoring variables Lj, Ui are generated as the 
(ftth and g 2 th order statistics of 20 independent uniform(0, 1) random variables. The resulting 
observation is 

Tj, if Li <T { < Ui] 
(0,Li], if T t <L i; 
(U u oo], if Ti>Ui. 
By adjusting qi and q 2 we obtain varying degrees of censoring. 

As in Section 3.3, all algorithms are started at the uniform probability vector, and the com- 
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Table 2: Means and standard deviations of the number of iterations and computing time (in 
seconds) until convergence for four algorithms for doubly censored data. Input data are generated 
with qi = 3 and q 2 = 18. 







Iteration count 






Compui 


;ing time 




EM 


in in 


V Huvl 


Cocktail 


EM 


NNE+ 


v n/ivi 












I v — X WWW 










mean 


5076 


15456 


5714 


46.2 


3.22 


20.5 


0.80 


0.07 


s.d. 


194 


488 


206 


7.7 


0.20 


1.2 


0.04 


0.01 










n = 2000 










mean 


9920 


44044 


12008 


67.3 


17.0 


157 


3.77 


0.28 


s.d. 


892 


1645 


269 


7.3 


2.2 


9 


0.15 


0.08 










n = 4000 










mean 


20347 


100000+ 


24924 


93.3 


85.6 


740+ 


19.1 


0.77 


s.d. 


1056 




369 


7.9 


5.2 




0.7 


0.06 



mon convergence criterion is ( 13.31) with e = 10 -6 . Our limited experience suggests that VEM 
may benefit from a starting value with fewer support points, but the cocktail algorithm is rel- 
atively insensitive to the initial number of support points. Again, an iteration of the cocktail 
algorithm consists of an iteration of VDM, the set of nearest neighbor exchanges, and an iteration 
of conventional EM. 

Based on 10 replications, Tables 2 and 3 display the means and standard deviations of the 
number of iterations and computing time (in seconds) until convergence for EM, NNE+, VEM 
and the cocktail algorithm. The input data are generated with qi = 3 and q 2 = 18 (moderate 
censoring) for Table 2, and with qi = 8 and q 2 = 12 (heavy censoring) for Table 3. 

In either situation we see that the cocktail algorithm is a dramatic improvement; it reduces 
the computing time of (conventional) EM or NNE+ by large factors, the reduction being more 
significant as n, the number of units, becomes larger. The improvement is especially remarkable 
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Table 3: Means and standard deviations of the iteration count and computing time (in seconds) 
until convergence for four algorithms for doubly censored data. Input data are generated with 
qi — 8 and q2 = 12. 



Iteration count Computing time 





EM 


NNE+ 


VEM 
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EM 


NNE+ 


VEM 
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n = 1000 










mean 


5793 


2768 


2170 


65.3 


7.77 


1.70 


0.38 


0.05 


s.d. 


877 


421 


170 


8.3 


1.76 


0.34 


0.11 


0.01 










n = 2000 










mean 


11034 


8669 


4411 


103 


38.0 


13.0 


1.56 


0.20 


s.d. 


2022 


491 


192 


8 


8.3 


1.0 


0.08 


0.01 










n = 4000 










mean 


20397 


27247 


9176 


145 


163 


89.8 


8.76 


0.61 


s.d. 


5481 


1957 


336 


17 


51 


8.8 


0.57 


0.08 
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because the cocktail algorithm is a direct combination of EM and NNE+, each of which is very 
slow. NNE+ performs much worse for moderately censored data than for heavily censored data. 
In the case of q\ = 3, q 2 = 18 and n = 4000, each of the 10 runs of NNE+ takes more than 
100000 iterations. As expected, each algorithm takes more iterations as n increases. Somewhat 
unexpectedly, for the same n, EM has similar iteration counts for moderately versus heavily 
censored data. Yet the computing time of EM in the heavily censored case is significantly higher. 
Another peculiarity is that, for the same n, VEM takes fewer iterations and less time for heavily 
censored data than for moderately censored data. But the main feature in Tables 2 and 3 is the 
clear superiority of the cocktail algorithm in this example. 

5 Discussion 

We have shown how to use efficient data augmentation to design fast EM-type algorithms for 
maximizing a mixture log-likelihood with known component densities. Squeezing strategies are 
presented that take advantage of the overlap between components. A cocktail algorithm that 
combines conventional EM with a nearest neighbor exchange strategy is found to perform very 
well for computing the NPMLE for censored data, which is the intended application area of this 
work. 

The nearest neighbor exchange strategy works well with conventional EM when there is a 
natural ordering of the mixture components, as in the case of univariate censored data. It would 
be interesting to extend such algorithms to bivariate censoring (Betensky and Finkelstein 1999), 
where we observe a pair of possibly censored random variables for each unit. Bivariate censoring 
presents many inferential and computational challenges. Work on extending the effective nearest 
neighbor strategy is in progress, with encouraging preliminary results. Extensions that accommo- 
date truncation in addition to censoring, or that facilitate semi-parametric estimation, would also 
be desirable. 

It would be interesting to extend the squeezing strategies of Section 2 to mixture problems 
with unknown parameters in the component densities. One approach is to again adopt AECM 
(Meng and van Dyk 1997), and perform two types of EM-based maximization steps, one for 
the mixture proportions given the other parameters, and one for the other parameters given the 
mixture proportions. The squeezing strategies can be used at the former maximization step. It 
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would be worthwhile to investigate the potential gain of using such strategies. 
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Appendix A: Fast Implementations for the NPMLE Prob- 
lem with Censored Data 

The conventional EM mapping, (I2.2p . contains a summation of n terms for each j = 1, . . . ,m. 
However, one can take advantage of the special structure of the matrix (fy) for fast computation. 
This is a zero-one matrix whose non-zero entries are consecutive in each row. Equivalently, if we 
let 

a(i) = min{j : zj G (k, }-*]}, b(i) = max{j : zj G (k, r»]}, i = 1, . . . , n, (.1) 

then fij = 1 if a(i) < j < b(i) and = otherwise. For convenience, we drop the superscripts in 
( 12.21) and focus on how to efficiently compute 

for any p = (p l3 . . . ,p m ) G 6. Note that computing ^ = Y^k=i fakVki i — 1, ■ ■ ■ ,n, can be done in 
0(m + n) time (or equivalently 0(n) time because m < 2n + 1) using Algorithm 1. 

Algorithm 1 

Step 1 Calculate the cumulative sums Sj = Ylk=iPk, j — 1> • • • ,m- Set s = 0. 
Step 2 Set rji = s b ^ - s a(i )_i, i = l,...,n. 
If we can calculate 

n 

*i = Y,f*lm= E V»fc. (-2) 

i=l i: a(i)<j<b(i) 

then p™ ew = djPj/n. To calculate dj efficiently, we rely on the following algorithm. 
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Algorithm 2 

Step 1 Initialize d 3 ■ = 0, j — 1, . . . , m. 

Step 2 For i = 1, . . . ,n, add 1/77^ to d a ^, and, if b{i) + 1 < m, subtract 1/rji from d b ^ + i. 
Step 3 Replace (di, . . . , d m ) by its cumulative sum, i.e., for j = 2, . . . , m, add dj-i to o?j. 

Obviously, Algorithm 2 costs O(n) time. We have 
Proposition 1. Algorithm 2 is valid, i.e., its output agrees with Oj) . 

Proof. This can be shown by induction. First, in the output of Algorithm 2, d\ = 5^*-o(i)=i V 7 ^ 
which agrees with (TJ2]). Assume Algorithm 2 gives the correct answer for dj-i, j > 1. Then 
Algorithm 2 computes dj using 

i: a(i)=j i: b(i)+l=j 

which again agrees with ([2]) if we consider the difference dj — dj_\. By the induction principle, 
all dj, j = 1, . . . , m, are correctly computed. □ 

Algorithms 1 & 2 clearly give an 0(n) implementation of a conventional EM iteration. Because 
rji, i — 1, . . . , n and dj, j = 1, . . . , m, are also the key quantities for VDM and VEM, the same 
efficient implementation applies to VDM and VEM. Specifically, the underlying iteration ( 12. 181) is 
done in 0(n) time by keeping track of rji. 

For NNE+ and the cocktail algorithm, we notice that each sub-step of nearest neighbor ex- 
change in (13. II) affects only a limited number of terms in the log-likelihood. Specifically, to imple- 
ment sub-step k, only observation intervals (/j, r^] such that either a(i) < jk < b(i) and jk+i > b{i), 
or jk < a(i) and a{i) < jk+i < b{i), need be considered. Define the set 

V k = {i : a(i) < j k < b(i), j k+1 > b{i)} U {i : j k < a(i), a(i) < j k+ i < b(i)}. 

The time cost of sub-step k is proportional to \ V k \, the number of entries in V k . However, because 
each i belongs to at most two of V k , k = 1, . . . , q, the total number of entries satisfy J2l=i 1^*1 — 
Hence, with a bit of bookkeeping, the entire set of nearest neighbor exchanges can be implemented 
in 0(n) time per iteration. 
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Remark. Algorithms 1 and 2 take a(i), b(i), i = 1, . . . , n, given by Oj) . as input. Setting 
these up requires sorting the end points of the observation intervals r$], which costs O(nlogn) 
time. Although slightly higher than the 0(n) per-iteration cost, this is typically a small fraction of 
the total computing time because of the required number of iterations. Setting up the full matrix 
(fij), on the other hand, costs 0(mn) time, which is 0(n 2 ) in the worst case. 
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